-
Notifications
You must be signed in to change notification settings - Fork 13.9k
More efficient HellaSwag implementation #2677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tested with OpenLLaMA 3bv2-q8_0 We might get a speedup and similar scores as master by using a whole task in one ctx window. And inserting BOS/EOS in between the full queries. Something like |
Hm, not sure. I tested with LLaMA-2 7B and the result matches 100% for the 800 tasks I checked. At 400 tasks this PR and master are both 0.25 higher compared to what is posted in the table in #2321. My hypothesis was that perhaps this is due to |
|
It may be something with the openllama models. I will run a test using llama2-7b. |
|
Here is what I get with Master and this PR using OpenLLaMA-3B and fp16. You only see one curve because the result of both runs is identical. Here the first 100 tasks on Master: ml-f16.bin -t 1 -ngl 100 --hellaswag-tasks 10042 system_info: n_threads = 1 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | task acc_norm |
|
Yes, my bad. I was comparing the PR against a run I did when the hellaswag was implemented. This means that some changes since then have lowered the scores that much! |
How does the 3shot work? By using |
I tried a bunch of different stuff. The best approach so far is illustrated by the first task of the dataset: The first 3 query+response surrounded by |
This could happen if you try to pass more tokens than |
With other words, I need to split into batches if the number of tokens is greater than |
|
Set the |
Yes, that's correct
This looks like CUDA, which still uses the scratch buffers, which have a maximum batch size of 512. With the other backends, you should be able to use any batch size. However, keep in mind that the command line parsing at |
|
@slaren Yes, I'm running on CUDA (doing HellaSwag on the CPU is a hopeless undertaking). But is it really the CUDA backend?. The error is raised in |
Are you saying you do the evaluation by first "steering" the model using the training dataset? |
If this is how you want to see it, sure. The way I see it, this is basically my interpretation of what few-shot evaluation is. |
So the problem is that due to some limitations, the CUDA backend still uses ggml scratch buffers for memory allocation, while the other backends have already moved to the graph allocator. The size of the scratch buffers was determined manually for a batch size of 512, while the graph allocator calculates the size of buffers dynamically and therefore can use any batch size. When using the graph allocator, the compute context is |
Yes, but I do not think the training dataset should be used at all when testing. Have you tested using only data from the validation dataset? |
So, I guess the long-term solution would be to move the CUDA backend to the graph allocator. But what is a quick fix? Just play around with |
That should work. However, I think that the proper solution would be to respect |
|
Maybe test this PR in the GGUF branch. The perplexity is slightly lower with the changes there, so hellaswag scores could be higher. |




Instead of evaluating context + ending for the 4 endings, we evaluate context + ending for the 1st ending, and then just the ending using
n_past = context lengthfor the remaining 3. This gives a disappointing ~10% speedup for thehellaswag_val_full.txtdata (see #2321).Initially I tried 1st evaluating just the context, and then running the 4 endings with
n_past = context length, but this was ~10% slower.The efficiency gain is much more significant for a few-shot HellaSwag evaluation, where one adds to the context additional examples from a training dataset. For instance, with 3 additional examples, this PR runs nearly 2X faster compared to master (I could not go beyond 3 because I'm running into a
ggml_new_object: not enough space in the context's memory poolerror despite using-c 1024and the context being just 647 tokens when it fails).